Skip to content

Revert "[MoE Refactor] Remove MoE DP chunking" (#39107)#39853

Draft
vllm-agent wants to merge 1 commit intovllm-project:mainfrom
vllm-agent:auto-revert/pr-39107
Draft

Revert "[MoE Refactor] Remove MoE DP chunking" (#39107)#39853
vllm-agent wants to merge 1 commit intovllm-project:mainfrom
vllm-agent:auto-revert/pr-39107

Conversation

@vllm-agent
Copy link
Copy Markdown

Revert of #39107

This reverts commit e1e318a.

Reason: This PR is suspected of causing 1 new CI failure in build #61308:

  • Kernels FusedMoE Layer Test (2 H100s)test_moe_layer[False-deepep_low_latency-2-1-True] crashed with SIGABRT

The PR modified tests/kernels/moe/test_moe_layer.py and multiple MoE layer/runner files (fused_moe/layer.py, fused_moe/config.py, fused_moe/runner/ etc.).

Original PR: #39107


Auto-generated by CI failure analyzer

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces ChunkingMoERunner, a wrapper that enables processing large MoE batches in smaller chunks to maintain lockstep across Data-Parallel ranks. It adds environment variables VLLM_MOE_DP_CHUNK_SIZE and VLLM_ENABLE_MOE_DP_CHUNK for configuration and implements a chunked_sizes context manager in DPMetadata to manage per-rank token counts. The changes also include refactoring test utilities and engine configuration to decouple scheduler defaults from parallel settings. Review feedback identifies critical bugs in the chunking logic related to zero-token scenarios, which could cause negative indexing or failed tensor copies during collective operations.

Comment on lines +146 to +148
assert out_slice.size(0) >= slice_size
out_slice = out_slice[:slice_size, :]
out_slice.copy_(orig_slice, non_blocking=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The copy_ operation will fail if orig_slice is empty, which occurs when num_tokens is 0 on a Data Parallel rank. In such cases, the workspace buffer should be zeroed out to provide a valid (though dummy) input for the collective kernels to maintain lockstep across all ranks.

        assert out_slice.size(0) >= slice_size
        out_slice = out_slice[:slice_size, :]
        if orig_slice.numel() > 0:
            out_slice.copy_(orig_slice, non_blocking=True)
        else:
            out_slice.zero_()
        return out_slice

Comment on lines +185 to +186
chunk_start = min(chunk_start, num_tokens - 1)
chunk_end = min(chunk_end, num_tokens)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The clamping logic fails when num_tokens is 0 because num_tokens - 1 becomes -1, leading to negative indices and size mismatches during tensor copying. When num_tokens is 0, the runner should still provide a single dummy token (consistent with _compute_chunked_local_num_tokens in forward_context.py) to ensure all DP ranks participate in collective operations.

            # clamp start and end
            if num_tokens > 0:
                chunk_start = min(chunk_start, num_tokens - 1)
                chunk_end = min(chunk_end, num_tokens)
                # If the chunk is entirely padding for this rank
                if chunk_start >= chunk_end:
                    chunk_start = num_tokens - 1
                    chunk_end = num_tokens
            else:
                chunk_start = 0
                chunk_end = 1


# Store outputs
# TODO(bnell): document when chunk_start >= num_tokens
if chunk_start < num_tokens:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The copy-back condition should use the original chunk start index (chunk_start_) rather than the clamped chunk_start. If chunk_start_ >= num_tokens, the chunk is purely padding for this rank, and its result should not be copied back into the output tensor (which would be redundant or even crash if num_tokens is 0).

Suggested change
if chunk_start < num_tokens:
if chunk_start_ < num_tokens:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant